Part 0 - Completed
Part 1 - Completed
Part 2 - Completed
Part 3 - Completed
Part 4 - Completed
Part 0 - Completed
Part 1 - Completed
Part 2 - Completed
Part 3 - Completed
Part 4 - Completed
What do you think is the average revenue of your favourite youtuber? Explain how did you conclude this answer and what was your approach towards this question in brief
*SOLUTION *
For this part I have used the World Affairs Youtube channel
When it comes to predicting the average earnings of a YouTuber with a CPM of
I have used the following method to determine the projected revenue for a YouTuber with 1 million daily views and a CPM of $1.50:
Estimated Daily Earnings = (Number of Daily Views / 1,000) x CPM
The calculation would therefore look like this if we entered the variables for a YouTuber with 1 million daily views and a CPM of $1.50:
Estimated Daily Earnings = (1,000,000 / 1,000) x
Now, we can estimate the YouTuber's monthly earnings, which range from
To calculate the estimated monthly earnings, we simply need to multiply the estimated daily earnings by 30:
Estimated Monthly Earnings = Estimated Daily Earnings * 30
Estimated Monthly Earnings =
Now, to calculate the projected yearly earnings, we can multiply the estimated monthly earnings by 12:
Projected Yearly Earnings = Estimated Monthly Earnings x 12
Projected Yearly Earnings =
import pandas as pd
# Load the Excel file into a pandas dataframe
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
# View the first five rows of the dataframe
print(df.head())
Year Month Segment Region KPI Value Type Value 0 2020 12 Clients India Lv1_Visitors Actuals 3665558 1 2020 12 Clients India Lv2_Visitors Actuals 2689569 2 2020 12 Clients India Lv3_Visitors Actuals 1300571 3 2020 12 Clients India Lv4_Visitors Actuals 717608 4 2020 12 Clients India Lv3_Visitors Actuals 706677
What type of company this dataset belongs to?
Solution :
According on the information in the dataset, it looks to be a funnel analysis dataset that is the property of a business involved in the e-commerce sector.
The dataset includes information on visitor counts at various phases of the customer journey, conversion rates, and income produced, from the first website visit through the final purchase stage.
Problem 2: Suppose that this dataset is for a website like Flipkart, what could be the possible definitions of the columns Level(visitors) 1, 2, 3, 4 and 5 in the given dataset? Do you observe any pattern?
Solution:
The possible definitions of the columns "Level (Visitors) 1, 2, 3, 4, and 5" could reflect the various stages of the customer journey if we suppose that this dataset is for a website like Flipkart.
Level 1 might correspond to site visitors who first arrived,
Level 2 to those who browsed particular product categories,
Level 3 to those who clicked on a particular product,
Level 4 to those who added items to their shopping cart, and
Level 5 to those who completed the checkout process.
Trend
From the data, we can see a trend in which the number of visitors gradually declines as we travel to higher levels, which is what is predicted by a funnel analysis. Additionally, we can see that the conversion rate tends to go up when we reach higher levels, showing that customers are more inclined to buy from website users who are more engaged with it. The necessity of optimizing the website's performance to maximize conversion rate and revenue production is highlighted by the fact that revenue generated also rises dramatically as we approach the funnel's final stages.
What was the total number of visitors segmented by each level, every month in each year?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Read the dataset into a pandas dataframe
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
df.head()
df.Value
0 3665558
1 2689569
2 1300571
3 717608
4 706677
...
1567 1693
1568 1428
1569 1311
1570 1071
1571 527
Name: Value, Length: 1572, dtype: int64# Create a pivot table to show the total number of visitors segmented by each level, every month in each year
pivot = pd.pivot_table(df, values='Value', index=['Year', 'Month'], columns=['Segment', 'KPI'], aggfunc=np.sum)
# View the pivot table
print(pivot)
Segment Clients \
KPI Lv1_Visitors Lv2_Visitors Lv3_Visitors Lv4_Visitors Lv5_Visitors
Year Month
2020 1 3322789 2304318 2410633 1157186 363939
2 2976450 2021915 2101659 1011351 318689
3 3453194 2334668 2440362 1186822 380754
4 4583672 3157506 3309406 1750651 555127
5 4703447 3333479 3393510 38408015 535507
6 4035885 2797449 2634588 1181799 415727
7 3586692 2461470 2225206 954541 335742
8 3133000 2125630 1876039 790587 270832
9 3377895 2296989 2052849 897110 311885
10 4286063 2923280 2720179 1147036 394290
11 3634761 2510682 2097387 938356 353697
12 3665558 2689569 2007248 897733 330491
2021 1 3709579 2737312 2069173 929905 334932
2 3067802 2127445 1673076 732142 260215
3 3048799 1956176 1646093 739998 257440
4 3057604 1953667 1567928 714884 243135
5 3278319 2111235 1709112 782186 275921
6 2790598 1770074 1466886 642765 209801
7 2883471 1885690 1530813 730577 243400
8 3235262 2071017 1699258 813048 265349
9 3017046 1942594 1623241 752299 260747
10 3223253 2025461 1706380 780300 280499
11 3665788 2261094 1843438 887883 321797
12 3543797 2199420 1782771 813245 285523
2022 1 5149212 3201562 2632792 1229833 459258
2 4614615 2826180 2301555 1076383 412729
3 4689717 2831029 2295964 948519 361298
4 4362681 2583815 2085700 956055 370116
5 4575236 2861299 2048233 909203 343327
6 4545397 2886077 2140205 932454 340917
7 4794170 3021946 2158995 1013750 365277
8 4602134 2851490 2151168 926916 341027
9 4522456 2790377 2287076 915374 326635
10 4705840 2894596 2227150 998514 375938
11 4857535 3089427 2352993 1118281 427054
12 4392558 2819845 2358706 988546 107913
Segment Customers
KPI Lv1_Visitors Lv2_Visitors Lv3_Visitors Lv4_Visitors Lv5_Visitors
Year Month
2020 1 605252 460953 555356 235096 68562
2 634726 488635 610242 260685 70144
3 700476 549325 696667 303831 83111
4 952301 752792 887839 397094 108386
5 957269 784931 974202 439014 117812
6 894628 734666 880345 380616 103971
7 801056 651910 773498 322480 86563
8 717727 576340 659037 270785 70552
9 781782 630075 729504 309821 85514
10 1014962 828265 965466 417155 116507
11 879507 712544 783797 342597 98855
12 780674 615366 675758 295589 82591
2021 1 889920 689741 739216 335905 96722
2 910358 725211 738563 349309 108845
3 1036446 847132 842369 396563 128086
4 836882 669841 700013 298319 84360
5 781835 623508 667727 268880 66645
6 721998 566659 599481 239613 57499
7 700108 552001 564444 231070 53886
8 659526 492179 461748 171445 35250
9 635646 452915 427475 143734 31027
10 616577 415019 363960 105240 27727
11 578423 396411 316443 115353 31779
12 629048 434953 311943 123228 29080
2022 1 637719 410887 317661 117952 31744
2 595216 392437 311249 130017 32760
3 802788 536075 346418 126856 31683
4 646645 439196 295395 106921 27377
5 701577 458601 274222 83972 23640
6 724643 445181 260741 70733 20426
7 649114 434245 269701 85301 21294
8 634380 435715 280149 82156 23418
9 588864 411400 276388 80989 22965
10 595663 418393 296815 94824 24425
11 619135 444157 331189 109141 30289
12 587719 400044 286948 92106 7303
# Now, making the table more visualized so that it is easy to understand
# Create a heatmap to visualize the pivot table
plt.figure(figsize=(15, 8))
sns.heatmap(pivot, cmap='YlGnBu', annot=True, fmt=',.0f')
plt.title('Total Number of Visitors Segmented by Each Level, Every Month in Each Year')
plt.show()
What is the percentage difference in the number of visitors between different regions and years?
# To calculate the percentage difference we will use the pivot table
pivot2 = pd.pivot_table(df, values='Value', index=['Year', 'Region'], columns='Segment', aggfunc='sum')
pivot2['Total'] = pivot2.sum(axis=1)
pivot2['Percent Diff'] = pivot2.groupby(['Region']).Total.pct_change()*100
# Create a bar plot to visualize the pivot table to easily understand the data
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
pivot2['Percent Diff'].plot(kind='bar', ax=ax[0], title='Percent Difference (Before)')
pivot2.dropna()['Percent Diff'].plot(kind='bar', ax=ax[1], title='Percent Difference (After)')
plt.tight_layout()
plt.show()
# Create a pivot table to aggregate the 'Value' column by the different columns
pivot_table = pd.pivot_table(df, index=['Year', 'Month', 'Segment', 'Region', 'KPI', 'Value Type'], values=['Value'], aggfunc=[sum, len])
# Print the pivot table
print(pivot_table)
sum len
Value Value
Year Month Segment Region KPI Value Type
2020 1 Clients India Lv1_Visitors Actuals 3322789 1
Lv2_Visitors Actuals 2304318 1
Lv3_Visitors Actuals 2410633 2
Lv4_Visitors Actuals 1157186 2
Lv5_Visitors Actuals 363939 2
... ... ...
2022 12 Customers Ujjain Lv1_Visitors Actuals 18327 1
Lv2_Visitors Actuals 11600 1
Lv3_Visitors Actuals 6821 1
Lv4_Visitors Actuals 2556 1
Lv5_Visitors Actuals 293 1
[1500 rows x 2 columns]
# Calculate the summary statistics for the 'Value' column
value_stats = df['Value'].describe()
print(value_stats)
count 1.572000e+03 mean 2.964329e+05 std 1.130364e+06 min 1.370000e+02 25% 9.247750e+03 50% 3.257450e+04 75% 1.796810e+05 max 3.698690e+07 Name: Value, dtype: float64
import seaborn as sns
# Create a boxplot of the 'Value' column for each unique combination of the six columns
sns.boxplot(x='Year', y='Value', hue='KPI', data=df)
Handle the outliers:
The Tukey method will be applied in this particular case to find and eliminate outliers.
# Create a copy of the dataframe
df_copy = df.copy()
# Create a boxplot to visualize the distribution of values
plt.boxplot(df_copy['Value'])
plt.title('Boxplot of Value')
plt.show()
● Transpose the data into a view as the reference given.
Approach :
using the pandas library we reads an Excel file called 'assignment.xlsx' and reads the 'Funnel' sheet into a Pandas DataFrame called df
After this the DataFrame is melted using the melt() function to convert the 'KPI' column, which contains the levels, into rows. The level number is extracted from the 'KPI' column using the str.extract() function and stored in a new 'Level' column. Then, a new DataFrame called pivoted_df is created by pivoting the melted DataFrame, so that there is one row for each unique combination of Year, Month, Segment, Region, and Value Type, and a new column is created for each level of visitors, using the pivot_table() function. The columns in pivoted_df are renamed to include the level number and the prefix "Lv" and "Visitors". Finally, the index of pivoted_df is reset to get the Year, Month, Segment, Region, and Value Type as columns in a new DataFrame called transpose_df. The head() function is used to preview the first five rows of transpose_df.
Assume you are a data analyst at Fittlyf company, analyze the data for the region performing worst in all the years and prescribe what could be the reason and how to improve the number of visitors from that region.
Solution
To analyze the data for the region performing worst in all the years, we can follow these steps:
By adding all the values in the columns Lv1_Visitors to Lv5_Visitors for each of the rows, we are able to calculate the total amount of visitors from each region.
After that, we may organize the information by region and figure out how many people visited that region overall across all years.
The region with the fewest visitors can then be found by ascending the data to find it.
Once the area has been determined, we can further analyze the data to ascertain the causes of the poor performance. This could involve taking into consideration elements like the segment type, the value kind, the season, and any localized marketing or promotional initiatives.
# By adding all the values in the columns Lv1_Visitors to Lv5_Visitors for each of the rows, we are able to calculate the total amount of visitors from each region.
# Select relevant columns
region_df = transpose_df[['Year', 'Month', 'Region', 'Lv1_Visitors', 'Lv2_Visitors', 'Lv3_Visitors', 'Lv4_Visitors', 'Lv5_Visitors']]
# Calculate total number of visitors from each region
region_df['Total_Visitors'] = region_df['Lv1_Visitors'] + region_df['Lv2_Visitors'] + region_df['Lv3_Visitors'] + region_df['Lv4_Visitors'] + region_df['Lv5_Visitors']
# Group data by region and calculate total number of visitors for each region across all years
region_visitors = region_df.groupby('Region')['Total_Visitors'].sum().reset_index()
# Sort data in ascending order to identify region with lowest number of visitors
worst_region = region_visitors.sort_values('Total_Visitors').iloc[0]['Region']
<ipython-input-25-d11952a02270>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy region_df['Total_Visitors'] = region_df['Lv1_Visitors'] + region_df['Lv2_Visitors'] + region_df['Lv3_Visitors'] + region_df['Lv4_Visitors'] + region_df['Lv5_Visitors']
worst_region
Solution:
The data analysis reveals that Aurangabad has had the worst performance throughout all years. This region consistently receives less tourists than other locations.
We must examine the various visitor types and demographic groups in this area if we are to comprehend the cause of the poor performance. We can observe from the data that level 1 and level 2 visitors make up the majority of visitors in Aurangabad. At higher levels, there are fewer visits overall, with level 5 seeing the lowest number of visitors.
Possible reasons for this low performance could be a lack of awareness or marketing efforts targeted at higher level visitors, lack of infrastructure and amenities to cater to higher level visitors, or a general lack of interest in health and fitness in this region.
The business should concentrate on targeted marketing and awareness activities to draw in higher calibre guests in order to increase the number of visits from Aurangabad. To meet the demands of higher level tourists, they could also make investments in enhancing the facilities and services. Finally, they can think about doing research and analysis to comprehend the local market and adjust their product offerings.
Based on the given data, identify which region is having a better YearOnYear growth.
import pandas as pd
# Load the data into a dataframe
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
# Group the data by Region and Year, and calculate the sum of Value for each group
grouped_df = df.groupby(['Region', 'Year'])['Value'].sum().reset_index()
# Calculate the YoY growth rate for each region
grouped_df['YoY Growth Rate'] = grouped_df.groupby('Region')['Value'].pct_change()
# Get the region with the highest YoY growth rate
best_region = grouped_df.sort_values(by='YoY Growth Rate', ascending=False)['Region'].iloc[0]
print("The region with the better YearOnYear growth is:", best_region)
The region with the better YearOnYear growth is: Dehradun
The region with the better YearOnYear growth is:
● Guess what could be the meaning of Level1 visitors, level 2 visitors up to level 5 and then answer the following-
Solution 1:-
It is possible that Level 1 visitors, Level 2 visitors, and Level 5 visitors all refer to distinct phases or levels of website visitors based on the naming convention. Each level could stand for a specific online activity or behaviour.
For instance, Level 1 visitors might be those who first arrived at the website, Level 2 visitors might be those who browsed particular product categories, Level 3 visitors might be those who clicked on a particular product, Level 4 visitors might be those who added items to their cart, and Level 5 visitors might be those who finished the checkout process.
Solving 1
To create a new feature we must extract the data from the KPI column, which contains the levels in the form of Lv1_Visitors to Lv5_Visitors, in order to develop a new feature (Level 5 visitors/Level 1 visitors). Using regular expressions, we can accomplish this and then carry out the calculation.
Year: 2020 , Segment: Clients 0 India Name: Region, dtype: object Year: 2020 , Segment: Customers 3 Faridabad 5 Uddepy 6 Ujjain Name: Region, dtype: object Year: 2021 , Segment: Clients 7 India Name: Region, dtype: object Year: 2021 , Segment: Customers 10 Faridabad 8 Aurangabad 13 Ujjain Name: Region, dtype: object Year: 2022 , Segment: Clients 16 Indore 14 Dehradun 18 Ujjain Name: Region, dtype: object Year: 2022 , Segment: Customers 23 Uddepy 19 Aurangabad 24 Ujjain Name: Region, dtype: object
The above output tells us the top 3 states for each year and every Segment
Solution 2:-
In this case we will use the ratio of Level 4 Visitors to Level 1 Visitors to develop a new statistic.
import pandas as pd
# Load the data
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
# Calculate the percentage change in value from first month to last month for each segment and region
df['YearMonth'] = df['Year'].astype(str) + '-' + df['Month'].astype(str).str.zfill(2)
df['FirstMonth'] = df.groupby(['Segment', 'Region', 'Year'])['Value'].transform('first')
df['LastMonth'] = df.groupby(['Segment', 'Region', 'Year'])['Value'].transform('last')
df['Change'] = ((df['LastMonth'] - df['FirstMonth']) / df['FirstMonth']) * 100
# Pivot the data to have a column for each segment and a row for each region and year
df_pivot = df.pivot_table(index=['Region', 'Year'], columns='Segment', values='Change')
# Print the top 3 states for each year and segment based on the new metric
for year in df['Year'].unique():
for segment in df['Segment'].unique():
print(f'Top 3 states for {year} - {segment}')
segment_data = df_pivot[segment].loc[:, year].dropna()
segment_data_sorted = segment_data.sort_values(ascending=False)
top_3_states = segment_data_sorted.head(3)
print(top_3_states)
print()
Top 3 states for 2020 - Clients Region India -97.655773 Name: Clients, dtype: float64 Top 3 states for 2020 - Customers Region Uddepy -83.740171 Aurangabad -84.881839 Ujjain -87.995754 Name: Customers, dtype: float64 Top 3 states for 2021 - Clients Region India -97.84048 Name: Clients, dtype: float64 Top 3 states for 2021 - Customers Region Dehradun -76.440596 Aurangabad -77.763027 Ujjain -78.872662 Name: Customers, dtype: float64 Top 3 states for 2022 - Clients Region Indore -83.631722 Dehradun -85.625803 Uddepy -88.592267 Name: Clients, dtype: float64 Top 3 states for 2022 - Customers Region Uddepy -91.099240 Aurangabad -91.509586 Ujjain -92.846620 Name: Customers, dtype: float64
The above output tells us the top 3 states for each year and every Segment
Write a function called predict_future(‘Region’,’Segment’) which, when called, would perform the following activity:
!pip install prophet
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: prophet in /usr/local/lib/python3.10/dist-packages (1.1.2) Requirement already satisfied: convertdate>=2.1.2 in /usr/local/lib/python3.10/dist-packages (from prophet) (2.4.0) Requirement already satisfied: tqdm>=4.36.1 in /usr/local/lib/python3.10/dist-packages (from prophet) (4.65.0) Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.10/dist-packages (from prophet) (1.22.4) Requirement already satisfied: LunarCalendar>=0.0.9 in /usr/local/lib/python3.10/dist-packages (from prophet) (0.0.9) Requirement already satisfied: matplotlib>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from prophet) (3.7.1) Requirement already satisfied: python-dateutil>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from prophet) (2.8.2) Requirement already satisfied: pandas>=1.0.4 in /usr/local/lib/python3.10/dist-packages (from prophet) (1.5.3) Collecting cmdstanpy>=1.0.4 Downloading cmdstanpy-1.1.0-py3-none-any.whl (83 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.2/83.2 kB 4.1 MB/s eta 0:00:00 Requirement already satisfied: holidays>=0.14.2 in /usr/local/lib/python3.10/dist-packages (from prophet) (0.23) Requirement already satisfied: pymeeus<=1,>=0.3.13 in /usr/local/lib/python3.10/dist-packages (from convertdate>=2.1.2->prophet) (0.5.12) Requirement already satisfied: korean-lunar-calendar in /usr/local/lib/python3.10/dist-packages (from holidays>=0.14.2->prophet) (0.3.1) Requirement already satisfied: hijri-converter in /usr/local/lib/python3.10/dist-packages (from holidays>=0.14.2->prophet) (2.3.1) Requirement already satisfied: pytz in /usr/local/lib/python3.10/dist-packages (from LunarCalendar>=0.0.9->prophet) (2022.7.1) Requirement already satisfied: ephem>=3.7.5.3 in /usr/local/lib/python3.10/dist-packages (from LunarCalendar>=0.0.9->prophet) (4.1.4) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (4.39.3) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (8.4.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (1.0.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (23.1) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (3.0.9) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (1.4.4) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.0->prophet) (1.16.0) Installing collected packages: cmdstanpy Attempting uninstall: cmdstanpy Found existing installation: cmdstanpy 0.9.5 Uninstalling cmdstanpy-0.9.5: Successfully uninstalled cmdstanpy-0.9.5 Successfully installed cmdstanpy-1.1.0
import pandas as pd
from prophet import Prophet
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1572 entries, 0 to 1571 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 1572 non-null int64 1 Month 1572 non-null int64 2 Segment 1572 non-null object 3 Region 1572 non-null object 4 KPI 1572 non-null object 5 Value Type 1572 non-null object 6 Value 1572 non-null int64 dtypes: int64(3), object(4) memory usage: 86.1+ KB
# Test the function with Region1 and Segment1
result = predict_future('Dehradun', 'Clients')
print(result)
# Test the function with Region2 and Segment3
result = predict_future('Ujjain', 'Customers')
print(result)
Generates the MAPE and RMSE of your prediction of the year 2022, 2021 & 2020 for the given parameters.
import pandas as pd
from prophet import Prophet
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
def predict_future(region='Region1', segment='Segment1'):
# Load the dataset
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
# Filter the dataset based on the provided Region and Segment
df = df[(df['Region'] == region) & (df['Segment'] == segment)]
# Remove any rows with NaN/Null values in the 'Value' column
df = df.dropna(subset=['Value'])
# Convert the 'Year' and 'Month' columns into a datetime format
df['ds'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str))
# Create a new DataFrame with the filtered data and the 'Year' and 'Month' columns as the index
df = df.set_index(['Year', 'Month'])
# Rename the 'Value' column to 'y'
df = df.rename(columns={'Value': 'y'})
# Create a new Prophet model and fit it to the data
model = Prophet()
model.fit(df)
# Use the `make_future_dataframe()` function to create a DataFrame with the date range for the next 3 years
future = model.make_future_dataframe(periods=36, freq='M')
# Use the `predict()` function of the fitted model to predict the 'Level 5' values for the next 3 years
forecast = model.predict(future)
# Filter the predictions for the years 2020, 2021, and 2022
predictions = forecast[forecast['ds'].dt.year.isin([2020, 2021, 2022])][['ds', 'yhat']].reset_index(drop=True)
# Calculate the actual values for the years 2020, 2021, and 2022
actual = df[df.index.get_level_values('Year').isin([2020, 2021, 2022])]['y'].reset_index(drop=True)
# Calculate the MAPE and RMSE for the predictions and actual values
mape = np.mean(np.abs((actual - predictions['yhat']) / actual)) * 100
rmse = np.sqrt(np.mean((actual - predictions['yhat']) ** 2))
# Return the predictions, actual values, and the MAPE and RMSE
return {'Predictions': predictions, 'Actual': actual, 'MAPE': mape, 'RMSE': rmse}
result = predict_future(region='Aurangabad', segment='Customers')
print(result['MAPE']) # Prints the MAPE
print(result['RMSE']) # Prints the RMSE
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmptdxzkjmb/ley2m8vo.json DEBUG:cmdstanpy:input tempfile: /tmp/tmptdxzkjmb/fxmrjkz0.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=21634', 'data', 'file=/tmp/tmptdxzkjmb/ley2m8vo.json', 'init=/tmp/tmptdxzkjmb/fxmrjkz0.json', 'output', 'file=/tmp/tmptdxzkjmb/prophet_modelpbbzibf4/prophet_model-20230506173550.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:35:50 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:35:50 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
235.43431334790165 2917.182063764239
Plot a line graph of the level 5 actual numbers from 2020-2022 & in the same graph, there should be the predicted numbers for 2023. The x-axis should be the timeline from 2020 Jan to 2023 Jun and the y-axis should be the value of the level 5 column and predicted values.
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: ipywidgets in /usr/local/lib/python3.10/dist-packages (7.7.1) Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets) (0.2.0) Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets) (7.34.0) Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets) (3.6.4) Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets) (5.5.6) Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets) (3.0.7) Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets) (5.7.1) Requirement already satisfied: jupyter-client in /usr/local/lib/python3.10/dist-packages (from ipykernel>=4.5.1->ipywidgets) (6.1.12) Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.10/dist-packages (from ipykernel>=4.5.1->ipywidgets) (6.2) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (3.0.38) Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (2.14.0) Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (4.8.0) Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (67.7.2) Collecting jedi>=0.16 Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 16.2 MB/s eta 0:00:00 Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (0.2.0) Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (0.7.5) Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (0.1.6) Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets) (4.4.2) Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from widgetsnbextension~=3.6.0->ipywidgets) (6.4.8) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets) (0.8.3) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (3.1.2) Requirement already satisfied: nbformat in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (5.8.0) Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (1.8.0) Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (21.3.0) Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.17.1) Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (1.5.6) Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (5.3.0) Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (23.2.1) Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (6.5.4) Requirement already satisfied: prometheus-client in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.16.0) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (2.8.2) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.10/dist-packages (from pexpect>4.3->ipython>=4.0.0->ipywidgets) (0.7.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0->ipywidgets) (0.2.6) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.6.1->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (3.3.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.1->jupyter-client->ipykernel>=4.5.1->ipywidgets) (1.16.0) Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.10/dist-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (21.2.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (2.1.2) Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.4) Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.7.1) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.8.4) Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (6.0.0) Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.7.4) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (1.5.0) Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (1.2.1) Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (4.9.2) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (4.11.2) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (23.1) Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.2.2) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (4.3.3) Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (2.16.3) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.19.3) Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (23.1.0) Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (1.15.1) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (2.4.1) Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (0.5.1) Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets) (2.21) Installing collected packages: jedi Successfully installed jedi-0.18.2 Enabling notebook extension jupyter-js-widgets/extension... Paths used for configuration of notebook: /root/.jupyter/nbconfig/notebook.json Paths used for configuration of notebook: - Validating: OK Paths used for configuration of notebook: /root/.jupyter/nbconfig/notebook.json
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
def predict_future(region='Region1', segment='Segment1'):
# Load the dataset
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/assignment.xlsx', sheet_name='Funnel')
# Filter the dataset based on the provided Region and Segment
df = df[(df['Region'] == region) & (df['Segment'] == segment)]
# Remove any rows with NaN/Null values in the 'Value' column
df = df.dropna(subset=['Value'])
# Convert the 'Year' and 'Month' columns into a datetime format
df['ds'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str))
# Create a new DataFrame with the filtered data and the 'Year' and 'Month' columns as the index
df = df.set_index(['Year', 'Month'])
# Rename the 'Value' column to 'y'
df = df.rename(columns={'Value': 'y'})
# Create a new Prophet model and fit it to the data
model = Prophet()
model.fit(df)
# Use the `make_future_dataframe()` function to create a DataFrame with the date range for the next 6 months
future = model.make_future_dataframe(periods=18, freq='M')
# Use the `predict()` function of the fitted model to predict the 'Level 5' values for the next 6 months
forecast = model.predict(future)
# Plot the predicted values using the `plot()` function of the `matplotlib` library
plt.plot(df['ds'], df['Level5 Visitors'], label='Actual')
plt.plot(forecast['ds'], forecast['yhat'], label='Predicted')
# Set the labels for the x and y axes
plt.xlabel('Timeline')
plt.ylabel('Level 5 Visitors')
# Set the title for the graph
plt.title('Actual vs Predicted Level 5 Visitors')
# Add a legend to the graph
plt.legend()
# Display the graph
plt.show()
# Return the predicted values for the next 18 months
return forecast[['ds', 'yhat']].tail(18)
Using “AB_TEST” sheet in the shared excel file, what is the possible metric you can create for A/B testing excluding no. of clicks and no. of visitors.
The metric that can be created for A/B testing excluding no. of clicks and no. of visitors are as follows:
Conversion rate: This is the proportion of visitors who complete an intended action, such buying something or filling out a form. By dividing the quantity of conversions by the quantity of visits, it may be estimated.
Revenue per user: This metric considers the typical sum of money earned by each user. By dividing the total revenue by the number of users, it may be computed.
Engagement rate: It is an indicator of user engagement with a website . It can be computed by dividing the total number of visitors by the number of active users.
Bounce rate: This statistic shows the proportion of site visitors who leave after just seeing one page. You may figure it out by dividing the total number of visits by the number of one-page visits.
Average session duration: This indicator examines how long visitors typically stay on a website. We can Compute it by dividing the total time spent on site by the number of sessions.
Perform an AB testing to find which variation whether control or treatment is better.
Solution :
To perform AB testing, we can define the null hypothesis as "there is no difference in the conversion rate between the control and treatment groups" and the alternative hypothesis as "there is a statistically significant difference in the conversion rate between the control and treatment groups".
AB testing is commonly used to examine the success rate of two variations and identify which is superior. Based on the given quantity of clicks and visitors, we want to decide which of two variations the control and treatment is more effective in this situation.
Clicks p-value: 2.4230205257275484e-20 Visitors p-value: 1.2501424893255777e-17 Control is the winning variation